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Abstract 

In  this  paper  we  introduce  the  notion  of  content  locality  in  distributed  document  collections.  Content 
locality  is  the  degree  to  which  content-similar  documents  are  colocated  in  a  distributed  collection.  We 
propose  two  metrics  for  measurement  of  content  locality,  one  based  on  topic  signatures  and  the  other 
based  on  collection  statistics.  We  provide  derivations  and  analysis  of  both  metrics  and  use  them  to 
measure  the  content  locality  in  two  kinds  of  document  collections,  the  well-known  TREC  corpus  and 
the  Networked  Computer  Science  Technical  Report  Library  (NCSTRL),  an  operational  digital  library. 
We  also  show  that  content  locality  can  be  thought  of  temporally  as  well  as  spatially  and  provide 
evidence  of  its  existence  in  temporally  ordered  document  collections  like  news  feeds.  ©  1999  Elsevier 
Science  Ltd.  All  rights  reserved. 


1.  Introduction 

Successful  design,  testing  and  deployment  of  digital  libraries  involves  research  in  a  variety  of 
disciplines,  including  information  retrieval,  databases,  collection  development,  archival  policies, 
human  computer  interaction,  intellectual  property  and  commerce  models  to  name  just  a  few. 
Here  we  consider  the  digital  library  (DL)  as  a  set  of  autonomous,  distinct  document  collections 
that  ‘cooperate’  to  support  search  and  retrieval.  In  this  distributed  setting,  we  expect  that  the 
topical  distribution  of  content  among  collections  (sites)  in  the  system  will  be  non-uniform.  For 
example,  in  a  DL  of  the  works  of  contemporary  literature  of  the  American  South,  we  would 
expect  that  the  materials  of  William  Styron  would  reside  in  large  part  at  Duke  University,  his 
alma  mater,  rather  than  be  distributed  uniformly  throughout  all  member  collections  in  the  DL. 
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A  major  finding  in  previous  work  in  distributed  information  retrieval  (Viles  &  French,  1995; 
French  &  Viles,  1996;  Viles,  1996)  is  that  such  content-based  allocation  of  documents  to  sites 
affects  the  quality  of  retrieval.  When  there  is  no  inter-site  communication,  distributed 
document  collections  whose  content  is  heavily  skewed  exhibit  much  poorer  retrieval 
effectiveness  than  collections  whose  content  is  uniformly  distributed. 

We  refer  to  this  phenomenon,  previously  called  ‘content  skew’  (Viles,  1996),  as  content- 
locality.  Intuitively,  content-locality  is  the  degree  to  which  topically  similar  documents  are  co¬ 
located  in  a  distributed  document  collection.  In  this  paper,  we 

•  provide  two  methods  for  measuring  content-locality,  one  topically  based  and  the  other 
statistically  based. 

•  measure  the  locality  of  the  multi-year,  multi-source  TREC  collection  using  the  topically 
based  metric. 

•  measure  the  locality  of  an  operational  distributed  digital  library,  the  Networked  Computer 
Science  Technical  Report  Library  (NCSTRL,  http://www.ncstrl.org/)  (Davis,  1995)  using  the 
statistically  based  metric. 

•  show  that  there  is  also  a  temporal  analogue  to  the  spatial  content-locality  documented  here. 

The  primary  goal  of  this  paper  is  to  describe  the  nature  of  content-locality  and  quantify  its 
presence  in  distributed  document  collections.  To  do  so,  we  define  methods  to  measure  content- 
locality  and  measure  the  locality  of  two  distributed  document  collections,  one  constructed  for 
IR  experimentation  and  one  that  is  an  operational  digital  library.  We  provide  a  critical  analysis 
of  each  of  the  proposed  methods  with  the  goal  of  gaining  additional  insight  into  the  underlying 
phenomenon. 

The  first  measure  we  give  here  is  topic-centric.  Essentially,  we  treat  a  document  collection  as 
a  set  of  topics  and  determine  how  each  topic  is  allocated  to  the  member  sites  of  the  distributed 
collection.  The  more  asymmetric  this  allocation,  the  higher  the  locality  and  the  more  uniform, 
the  lower  the  locality.  The  second  measure  is  statistically  based.  Eor  each  site  in  the  distributed 
collection,  we  look  at  the  distribution  of  terms  in  the  local  collection  and  compare  it  against 
what  they  would  be  in  a  centralized  collection  composed  of  the  contents  of  all  local  collections. 

The  notion  of  content-locality  in  distributed  document  collections  is  new.  Each  of  the  two 
possible  metrics  we  propose  here  has  its  advantages.  The  topically  based  metric  gives  insight 
into  the  distribution  of  content  by  topic  in  the  collection.  However,  the  fidelity  of  the 
measurement  depends  upon  the  accuracy  of  topic  determination.  The  statistically  based  metric 
is  simple  to  calculate  and  requires  no  subjective  determination  of  topic,  thus  it  holds  promise 
for  use  in  operational  systems.  For  both  metrics  there  is  an  interpretation  and  scaling  issue, 
e.g.  suppose  locality  is  measured  as  0.186,  does  that  represent  a  skewed  or  unskewed  system? 
These  problems  can  only  be  overcome  once  distributed  document  archive  systems  have  been 
deployed  and  analyzed  in  realistic  environments. 


2.  Why  measure  content-locality? 


If  we  can  determine  that  the  content-locality  of  a  distributed  collection  is  low,  then  the 
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implication  from  an  engineering  perspective  is  that  no  inter-site  communication  is  needed  in 
order  to  attain  good  search  effectiveness  (Viles  &  French,  1995;  French  &  Viles,  1996).  This  is 
desirable,  because  it  implies  that  a  site  can  operate  more  or  less  independently  and  allows 
much  more  flexibility  in  the  particulars  of  things  like  index  structure  (e.g.  whether  or  not 
collection  statistics  are  pre-computed  and  stored  in  the  index)  and  communication 
infrastructure  (e.g.  whether  or  not  a  site  must  handle  collection  statistic  updates  originating 
‘off-site’)-  Conversely,  if  the  content-locality  of  a  collection  is  high  then  some  kind  of  intersite 
communication  is  needed  to  achieve  search  quality  commensurate  with  a  centralized  system. 
Engineering  the  system  in  this  situation  then  involves  an  informed  trade-off  between  the  ‘best’ 
search  quality  and  a  simpler,  more  efficient  system.  On  the  other  hand,  it  is  also  possible  to 
exploit  highly  content-localized  systems  by  quickly  eliminating  or  deferring  search  at  sites  with 
low  topical  relevance. 

The  determination  of  content-locality  itself  requires  inter-site  communication,  but  it  is  of  a 
different  kind  than  what  would  be  used  operationally  when  sites  exchanged  collection  statistics 
like  document  frequency  or  document  lengths.  In  the  latter  case,  the  communication  may 
require  the  expensive  update  of  disk-based  structures  at  each  recipient  and  each  site  could 
potentially  be  receiving  from  every  other  site.  In  the  former  case,  each  site  need  send  only  it’s 
topical  or  statistical  descriptions  to  some  process  or  entity  that  has  the  ability  to  integrate  them 
into  a  single  unified  description.  The  availability  of  such  descriptions  is  commonly  assumed  in 
much  of  the  literature  on  collection  selection  (also  called  ‘database  selection’)  (Callan,  Lu  & 
Croft,  1995;  Gravano,  Chang,  Paepcke  &  Garcia-Molina,  1997;  Gravano  &  Garcia-Molina, 
1997). 

As  has  been  mentioned,  the  topical  locality  measurement  requires  a  set  of  topics  that,  taken 
together,  define  the  content  of  the  distributed  collection.  The  set  of  topics  itself  is  useful  in  a 
variety  of  ways,  so  in  some  respect  it  is  equally  appropriate  to  think  of  locality  as  one  of  many 
reasons  to  undertake  the  topical  determination  of  a  document  collection.  Once  the  topical 
content  is  determined,  locality  is  simple  to  calculate. 

In  addition  to  the  determination  of  content-locality,  there  are  many  potential  benefits  of 
knowing  the  topical  content  of  a  collection.  These  include: 


Topic 

Text  Excerpt 

161 

Document  will  provide  information  on  the  problems  and  actions  associated 
with  what  is  known  as  acid  rain. 

152 

Accusations  of  Cheating  by  Contractors  on  U.S.  Defense  Projects.  Document 
will  refer  to  an  alleged  illegality  committed  by  any  entity  seeking  a  contract 
on  behalf  of  the  U.S.  Military  Forces. 

37 

Document  identifies  software  products  which  adhere  to  IBM’s  SAA  stan¬ 
dards.  To  he  relevant,  a  document  must  identify  a  piece  of  software  which 
is  considered  a  Systems  Application  Architectural  (SAA)  component  or  one 
which  conforms  to  SAA. 

Fig.  1.  Three  topics  from  the  TREC  collection  that  exhibit  different  degrees  of  topical  locality.  Topics  are  arranged 
from  ‘low’  (topic  161)  to  ‘high’  (topic  37)  locality. 
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Table  1 

Summary  of  notation  used  in  content-locality  derivation 


hs.t 

nsfn,,  proportional  size  of  topic  t 

Cv 

NJN,  proportional  size  of  site  s 

n, 

size  of  topic  t 

n,^t 

size  of  topic  t  at  site  ^ 

N 

collection  size 

N.- 

collection  size  at  site  s 

S 

number  of  sites 

tr 

set  of  topics 

T 

number  of  topics 

<5., 

statistical  locality  for  site  ^ 

statistical  locality  for  system 

locality  for  topic  t  at  site  s 

at 

locality  for  topic  t 

a 

topical  locality  for  system 

•  Topic  tracking:  looking  for  topics  that  are  new,  ‘hot’,  or  have  very  little  or  very  much 
activity. 

•  Efficiency:  caching  documents  in  the  same  topic  together  or  nearby. 

•  Intelligent  browsing:  in  interactive  systems,  a  topical  map  can  provide  a  set  of  related 
starting  points  for  browsing  (Cutting,  Karger,  Pedersen,  &  Tukey,  1992;  Kellogg  &  Subhas, 
1996). 

•  Pre-fetching:  the  fetching  of  one  or  more  documents  in  the  same  topic  might  signal  the 
system  to  start  pre-fetching  other  documents  in  the  topic. 


3.  Topic-based  locality 

Before  presenting  the  details  of  the  topical  locality  measure,  we  present  three  topics  from  the 
TREC  experiments  (Harman,  1995)  in  Fig.  1.  When  considered  in  the  context  of  the 
heterogeneous  mix  of  sources  that  comprise  the  TREC  corpus,  these  topics  provide  further 
intuition  into  the  nature  of  locality.  Eater  (Section  7),  we  provide  specific  locality 
measurements  for  250  of  the  TREC  topics.  In  the  case  of  this  example,  the  topics  exhibited 
‘low’  (topic  161),  ‘average’  (topic  152)  and  ‘high’  (topic  37)  topical  locality. 

The  notation  we  use  in  this  paper  is  given  in  Table  1. 

The  locality  measure  we  define  in  this  section  is  topic-centered.  To  determine  locality  for  the 
entire  system,  we  first  determine  locality  for  each  individual  topic  and  then  combine  these  to 
get  a  system  level  measurement  of  content-locality. 

3.1.  Individual  topic  locality 

Intuitively,  there  are  two  contributing  factors  to  topic-based  content  locality.  First,  given 
some  topic  t,  the  total  number  of  sites  k  that  contain  some  member  of  t  affect  locality.  If  k  is 
small,  then  locality  should  be  high,  if  k  is  large,  then  locality  should  be  low.  Second,  given  t  is 
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represented  at  k  sites,  the  more  asymmetric  the  distribution  of  members  of  t  at  the  k  sites,  the 
more  content-localized  the  system  is. 

The  general  approach  we  take  is  to  define  content-locality  for  a  topic  as  the  sum  of  squared 
error  terms  where  ‘error’  is  the  distance  from  a  content-uniform  system,  one  where  a  topic  is 
equally  represented  at  all  sites. 

The  locality  for  topic  t  at  site  s  is  denoted  by  cr,  ^  and  is  calculated  by 

as,t  =  {bs,t- E[bs,t\f,  (1) 

where  bsj  =  ng  ilnt  is  the  size  n^j  of  the  topic  at  the  site  relative  to  the  overall  size  of  that 
topic.  Here  we  think  of  E[bsj\  as  the  expected  value  of  b^j  when  content  is  uniformly 
distributed  throughout  the  distributed  collection.  In  content-uniform  collections,  we  would 
expect  that  E[bsj\  would  track  the  proportionate  size  of  the  collection  at  site  x.  If  Cs  =  NslN  is 
the  proportionate  size,  then 


Vt,  E[bsa]  =  Cs 

so  by  substitution  Eq.  (1)  becomes 

^S,t  —  ibs,t  hs) 

Locality  for  some  topic  t  is  denoted  cr,  and  is  determined  by  summing  the  locality  for  that 
topic  at  each  site  and  taking  the  square  root.  So 


ot  = 


M 


E' 

.v=l 


and  by  substitution 


s 

—  'y  ]{bs^t  cf) . 

\  .v=l 

In  Appendix  1,  we  show  that  for  any  distributed  collection,  0  <  cr,  <  V2,  and  for  collections 
where  each  site  is  approximately  the  same  size,  0  <  u,  <  1 . 

3.1.L  Behavior  of  a t 

We  would  like  the  measure  of  content-locality  to  reflect  the  two  contributing  factors  to 
locality  that  we  outlined  previously,  namely: 

1.  As  fewer  (more)  sites  contain  members  of  some  topic  t,  measured  locality  should  increase 
(decrease). 

2.  Given  that  k  sites  contain  members  of  t,  the  more  asymmetric  the  distribution  of  these 
members,  the  higher  measured  locality  should  be. 

In  Appendix  1  we  show  that  property  (1)  is  followed.  Specifically,  if  we  consider  the  system 
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where  a  topic  is  evenly  distributed  over  k  sites,  then  if  all  sites  have  about  the  same  number  of 
documents 

Gt  =  s/\/k-\/S. 

As  k—rS, 

Now  consider  property  (2).  The  measured  locality  for  a  topic  that  has  members  at  k  sites 
should  become  higher  as  the  distribution  becomes  more  non-uniform  or  asymmetric.  Our 
locality  measure  has  this  desirable  property  as  well.  Suppose  topic  t  has  members  at  3  of  S' 
sites.  As  before,  C5=l/S.  When  is  [1/3,  1/3,  1/3],  cr,  =  Vl/3  —  1/S.  When  is  [1/2,  1/4,  1/ 
4],  Gf  increases  to  V3/8  —  1/S.  Table  2  shows  cr,  as  the  topic  distribution  changes  from  [1/3,  1/ 
3,  1/3]  to  [4/5,  1/10,  1/10]. 

3.1.2.  Small  topics  and  Gt 

Consider  a  topic  u  with  a  single  member.  Since  there  is  only  one  document,  only  one  site, 
say  j,  can  have  it,  so 


Ou 


.v=l 


c 


2 


If  as  before  we  assume  c,  =  1  /S  then 


=  7(1  -  1/S)2  +  (S  -  1)(1/S)2  =  ^(1  -  2/S+  1/S)'  +  (S-  1)(1/S)2  = 

If  S  is  reasonably  sized  (>20),  then  locality  for  this  topic  is  both  high  and  hxed.  It  is  not 
possible  to  ‘de-localize’  single  member  topics.  A  similar  kind  of  analysis  can  be  made  for  topics 
of  size  m  where  m<gz  S.  The  importance  of  this  analysis  is  that  if  there  are  many  small  topics, 
they  must  be  properly  accounted  for  so  as  not  to  bias  the  overall  system  locality  measurement. 


3.2.  System  locality 

Given  that  we  have  a  method  to  calculate  locality  on  a  topic-by-topic  basis,  the  next 


Table  2 

Topic  locality  for  a  topic  distributed  over  3  of  S  sites  as  the 
distribution  changes  from  uniform  to  heavily  uni-modal 


Distribution 

Gt 

<7„  S  =  20 

[1/3,  1/3,  1/3] 

Vl/3-  1/5 

0.532 

[1/2,  1/4,  1/4] 

V3/8-  1/5 

0.570 

[3/5,  1/5,  1/5] 

Vll/25-  1/S 

0.624 

[2/3,  1/6,  1/6] 

Vl/2-  1/S 

0.671 

[4/5,  1/10,  1/10] 

V33/50  -  1/S 

0.781 
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problem  is  to  identify  the  proper  method  to  combine  a  set  of  topical  locality  measurements 
into  a  single  measurement  representing  the  content-locality  of  the  entire  system. 

In  Appendix  1  we  show  that  in  general  a,  is  bounded  between  0  and  and  under  equal¬ 
sized  site  assumptions,  it  is  bounded  between  0  and  1.  For  ease  of  interpretation,  we  would  like 
the  range  of  the  system  locality  measure  to  track  the  range  of  the  topical  measure.  This 
suggests  the  general  approach  of  averaging  some  or  all  of  the  calculated  topic  locality 
measurements  to  get  a.  Thus 

^  Zt(Tt 
teT' 


where 

'Y^zt=\,  zt>0 

The  set  '  is  the  group  of  topics  to  be  included  in  the  calculation,  where  IT'  IT  and  z,  is  a 
constant  that  reflects  the  contribution  of  each  topic  to  system  locality.  We  consider  three 
variations  on  setting  the  values  for  z,.  The  first  variation,  called  Equal,  treats  the  contribution 
of  each  topic  equally  by  setting  z,=  l/|,^'|.  The  second  method,  called  Weighted,  weights  the 
contribution  of  a  topic  according  to  its  overall  size,  so  in  this  case  z,  =  «,/(^,.g^,«,).  Sparse 
eliminates  the  contribution  of  small  topics  by  setting  z,  =  0  if  n,  is  less  than  some  threshold  and 
giving  topics  equal  weight  if  n,  is  above  the  threshold. 

The  rationale  behind  the  second  and  third  methods  is  two-fold.  As  we  illustrated  in  Section 
3.1,  small  topics  cannot  really  exhibit  content-uniformity.  They  are  inherently  localized.  The 
presence  of  small  topics  can  give  a  positive  bias  to  a  system  locality  measurement.  The  other 


Document  Counts 

Topic 

SI 

S2 

S3 

S4 

Ut 

1 

0 

1 

1 

0 

2 

2 

4 

2 

2 

2 

10 

3 

2 

2 

2 

1 

7 

4 

0 

0 

0 

1 

1 

Site  Sizes 

SI 

S2 

S3 

S4 

Us 

6 

5 

5 

4 

Cs 

0.30 

0.25 

0.25 

0.20 

Content  Locality 

t 

Ctl,t 

<72, t 

<73, t 

<74,4 

C7t 

^Equal 

^Weighted 

^Sparse 

1 

0.090 

0.063 

0.063 

0.040 

0.505 

2 

0.010 

0.003 

0.003 

0 

0.122 

0.408 

0.185 

0.100 

3 

<  0.001 

0.001 

0.001 

0.003 

0.078 

Aj 

0.090 

0.063 

0.063 

0.640 

0.925 

Fig.  2.  An  example  calculation  of  content-locality  using  a  collection  with  four  topics  spread  over  four  sites.  The 
table  at  top  left  shows  the  distribution  of  documents  at  the  four  sites.  The  size  of  each  site  is  given  at  top  right.  At 
the  bottom  is  locality  given  individually  for  each  (topic,  site)  combination,  each  topic,  and  for  the  three  methods  of 
calculating  system  locality,  ffsparse  was  calculated  by  eliminating  the  two  smallest  topics,  1  and  4. 
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reason  is  that  the  locality  measure  should  accurately  reflect  the  ‘strength’  of  each  topic. 
Weighting  each  topic  equally  does  not  accomplish  this. 

A  concrete  example  of  these  three  variations  is  provided  in  Fig.  2.  In  this  example  we  show 
a  four  site  system  with  four  topics.  Each  site  is  about  the  same  size,  but  the  topic  sizes  vary 
greatly.  The  last  three  columns  show  system  locality  as  measured  by  the  Equal,  Weighted  and 
Sparse  methods,  respectively.  Topics  2  and  3  in  Fig.  2  have  very  low  topical  locality  and 
together  make  up  85%  of  the  document  collection.  By  weighting  all  topics  equally,  the 
measured  locality  comes  out  much  higher  than  the  two  methods  that  minimize  the  contribution 
of  small  topics.  However,  simply  throwing  out  these  topics  seems  too  drastic  since  they  are 
part  of  the  collection.  The  Weighted  method  is  a  reasonable  compromise  between  these  two 
extremes  and  for  this  reason  is  the  method  of  choice. 


4.  Statistic-based  locality 


We  now  turn  to  an  alternative  method  for  measuring  content-locality  that  is  based  on 
statistical  properties  of  document  collections.  In  Viles  (1996)  we  showed  that  when  collection 
statistics  differ  from  that  defined  by  the  global  corpus,  effectiveness  can  suffer.  The  method  we 
give  here  quantifies  this  difference.  The  well  known  inverse  document  frequency  (idf)  term 
weighting  factor  is  often  calculated  as 


id4  =  log 


N 

d4 


for  some  term  k.  In  a  distributed  system,  each  site  s  has  it’s  own  version  of  statistics  derived 
from  the  local  corpus. 

■Hf  1 

idf^^yt  =  ^og-— 
dfv,yt 

If  we  define  a  centralized  oracle  Cen  that  has  knowledge  of  all  term  statistics  at  all  sites,  then 
we  can  define  the  difference  in  idf  for  some  term  k  between  a  site  s  and  the  oracle  as 


c  I  idf  s, A:  idf  Cot, A:  I 

^s,k  =  - \ -  (2) 

log(A) 

where  we  assume  term  k  is  found  in  some  document  at  both  sites ^  The  statistical  locality  of 
term  k  at  site  s  is 

The  denominator  of  Eq.  (2)  is  a  normalization  factor  that  scales  the  quantity  between  0  and 
1.  To  obtain  the  content  locality,  at  any  site  s,  we  sum  the  locality  measures  for  every  term 
present  in  the  local  collection,  C 


'  If  k  does  not  exist  at  s,  then  we  ignore  this  term  in  the  locality  calculation.  If  k  is  absent  from  s,  then  no  docu¬ 
ment  contains  it  at  s.  Therefore  any  query  containing  k  will  not  match  any  of  these  documents  on  k  even  if  those 
documents  were  located  at  the  Oracle.  So  k  makes  no  contribution  to  the  similarity  calculation  for  that  document 
and  should  be  ignored. 
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t>s  =  ^  E 

ke(C,nCo,) 

where  K'  is  the  number  of  unique  terms  in  C/lCor-  The  overall  system  locality  8  is  then  the 
average  of  the  locality  at  each  site. 


5.  Topic-based  locality  in  the  TREC  collection 

5.1.  Data  decomposition 

To  measure  topic-based  locality  we  used  substantive  subsets  of  the  TREC  data  (Harman, 
1995;  Voorhees  &  Harman,  1996).  The  TREC  data  comes  from  multiple  sources,  consisting  of 
documents  from  AP  Newswire  (1988-1990),  Wall  Street  Journal  (1987-1992),  Computer  Select, 
Eederal  Register  (1988  and  1989),  San  Jose  Mercury  News  (1991),  abstracts  from  DOE 
publications  and  US  Patents  (1993).  Several  sources  cover  multiple  years  or  time  periods. 


Table  3 

Representation  of  the  five  topic  sets  among  the  17  document  sets  of  the  TREC  data.  The  nature  of  the  TREC  exper¬ 
iments  means  that  not  all  document  sets  contribute  to  each  topic  set.  Document  counts  taken  from  Callan  et  al. 
(1995) 


Name 

Documents 

Topic  sets  represented 

1-50 

51-100 

101-150 

151-200 

201-250 

AP  88 

79,919 

X 

X 

X 

X 

X 

AP  89 

84,678 

X 

X 

X 

X 

AP  90 

78,321 

X 

X 

X 

DOE 

226,087 

X 

X 

X 

X 

Fed.  Reg.  88 

19,860 

X 

X 

X 

X 

X 

Fed.  Reg.  89 

25,960 

X 

X 

X 

X 

Patent 

6,711 

X 

X 

X 

SJMN  91 

90,257 

X 

X 

X 

WSJ  87 

46,448 

X 

X 

X 

X 

WSJ  88 

39,904 

X 

X 

X 

X 

WSJ  89 

12,380 

X 

X 

X 

X 

WSJ  90 

21,705 

X 

X 

X 

X 

X 

WSJ  91 

52,652 

X 

X 

X 

X 

X 

WSJ  92 

10,163 

X 

X 

X 

X 

X 

ZIFF  1 

75,180 

X 

X 

X 

X 

ZIFF  2 

56,920 

X 

X 

X 

X 

X 

ZIFF  3 

161,021 

X 

X 

X 
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Table  4 

Measurements  of  topic-based  content-locality  for  the  five  topic  sets  of  the  TREC  collection 


Topic  set 

Sites 

^Weighted 

O^Equal 

1-50 

13 

0.513 

0.489 

51-100 

17 

0.389 

0.396 

101-150 

17 

0.399 

0.401 

151-200 

13 

0.443 

0.445 

201-250 

10 

0.409 

0.439 

There  is  as  yet  no  generally  agreed  upon  decomposition  of  the  TREC  data  to  do  distributed 
information  retrieval  experiments,  though  several  have  been  proposed  and  used  (Walczuch, 
Fuhr,  Tollman,  &  Sievers,  1994;  Callan  et  ah,  1995;  Voorhees,  1996;  French,  Powell,  Viles, 
Emmett,  &  Prey,  1998)  with  a  general  trend  of  decomposition  into  more  and  more  sites.  One 
natural  way  to  consider  the  TREC  data  as  a  distributed  collection  is  to  make  each  source  and 
year  a  site.  This  is  the  method  that  is  used  in  work  reported  by  Callan  et  al.  (1995)  and 
Voorhees,  Gupta,  and  Johnson-Eaird  (1995)  on  the  ‘collection  fusion’  problem  and  is  the  data 
decomposition  we  used  in  the  experiments  reported  here^.  This  set  of  candidate  sites  is 
described  in  Table  3. 


5.2.  Topic  identification 

The  major  hurdle  in  the  calculation  of  content-locality  is  topic  identification.  The  method  we 
use  for  the  TREC  corpus  is  to  treat  each  query  as  a  topic.  The  documents  relevant  to  that 
query  are  considered  to  be  the  members  of  the  topic.  For  the  large  collections  in  particular, 
this  leaves  a  large  number  of  documents  that  do  not  belong  to  an  identihed  topic  because  they 
are  not  relevant  to  any  of  the  queries  provided.  This  is  somewhat  unsatisfying,  since  the 
disposition  of  these  documents  may  have  an  effect  on  the  actual  content-locality  of  a  particular 
collection.  However,  we  can  consider  these  documents  as  members  of  unidentified  topics  which, 
if  known,  would  have  been  handled  as  the  known  topics  were.  The  identified  topics  are  then 
considered  representative  of  the  universe  of  possible  topics  and  conclusions  drawn  from  the 
accompanying  results  are  valid.  This  kind  of  assumption  has  long  been  assumed  in 
experimental  IR  work.  The  possibility  of  bias  in  the  set  of  queries  is  one  reason  multiple 
collections  are  used  in  IR  experimentation. 

As  we  have  mentioned,  we  used  the  group  of  topics  provided  with  the  TREC  collection  and 
the  set  of  accompanying  relevant  documents  to  identify  the  members  of  each  topic.  However, 
because  of  the  nature  of  the  TREC  experiments,  not  all  of  the  subcollections  identified  in 
Table  3  have  relevance  judgements  for  all  of  the  five,  50  member  TREC  topic  sets  (numbered 
1-50,  . . .,  201-250)  we  used  in  this  study.  For  example,  the  topic  members  for  topic  set  201- 
250  have  been  identified  for  only  10  sites.  When  we  calculate  locality  for  any  particular  topic 
set,  we  can  use  only  the  subcollections  for  which  the  topic  members  have  been  identified. 


^  Collection  fusion  is  the  process  of  merging  results  from  searches  performed  on  different  collections. 
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5.3.  Results 

Because  we  have  five  sets  of  topics,  we  generated  five  measurements  of  topic-based  system 
locality.  These  measurements  appear  in  Table  4.  Since  they  are  single  measurements,  it  is  hard 
to  assess  whether  the  difference  between  topic  sets  is  significant.  However,  the  absolute 
differences  are  small. 

In  Table  4  we  also  give  the  unweighted  locality  measure,  crEquai-  These  match  closely  with 
crweighted,  indicating  that  there  is  relatively  little  ‘small  topic  effect’  in  these  measurements.  This 
observation  is  also  supported  by  a  scatterplot  plotting  locality  against  topic  size  (Fig.  3).  In 
this  plot,  we  show  five  sets  of  TREC  topics  using  different  symbols,  something  which  TREC- 
initiated  readers  may  find  helpful.  Otherwise,  the  plot  can  be  interpreted  as  a  simple  scatterplot 
and  the  difference  between  symbols  ignored.  Regardless,  there  appears  to  be  little  correlation 
between  locality  and  topic  size. 


6.  Statistic-based  locality  in  NCSTRL 

The  Networked  Computer  Science  Technical  Report  Eibrary  (http://www.ncstrl.org)  is  a 
distributed  collection  of  technical  reports  from  over  100  academic  sites  doing  research  in 
Computer  Science.  It  has  been  operational  since  July  of  1995  and  currently  services  thousands 
of  queries  per  day.  Here  we  measure  the  content  locality  of  this  operational,  distributed 
document  collection  using  the  statistically  based  measure,  5. 

The  data  we  used  for  this  analysis  was  obtained  at  two  different  times,  late  July  of  1995 
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Fig.  3.  Topic  size  versus  locality  in  the  250  TREC  topics. 
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Table  6 

Statistic-based  content-locality  of  the  NCSTRL  collection  at  two  times,  July  1995  and  February  1998.  At  top  is  the 
locality  of  the  18  sites  who  were  members  of  NCSTRL  at  both  times.  Summary  locality  measurements  for  February 
1998  are  given  for  both  the  18  ‘original’  NCSTRL  sites  as  well  as  for  the  entire  NCSTRL  archive.  The  sites  are 
arranged  in  order  of  increasing  locality  of  the  July  1995  measurement 


Site 

July  1995 

February  1998 

size 

locality  (5,) 

size 

locality  (d,.) 

MIT 

2342 

0.070 

2348 

0.094 

Cornell 

1412 

0.089 

1532 

0.109 

Stanford 

758 

0.104 

1230 

0.114 

Cal-Irvine 

477 

0.114 

564 

0.124 

Wisconsin 

521 

0.118 

643 

0.130 

Virginia  Tech 

420 

0.120 

480 

0.132 

Cal-Berkeley 

968 

0.121 

1107 

0.127 

Hong  Kong 

30 

0.125 

30 

0.124 

Virginia 

309 

0.131 

371 

0.115 

Princeton 

188 

0.134 

274 

0.161 

Auburn 

86 

0.134 

86 

0.137 

Maryland 

293 

0.136 

595 

0.143 

Chicago 

137 

0.137 

276 

0.137 

SUNY  Buffalo 

120 

0.146 

188 

0.141 

Old  Dominion 

93 

0.147 

183 

0.152 

Boston  U. 

48 

0.156 

107 

0.185 

UNC-Chapel  Hill 

95 

0.159 

155 

0.170 

Iowa  State 

100 

0.164 

123 

0.180 

System  locality  (d ) 

July  1995 

February  1998  (w=18) 

February  1998  (w  =  73) 

Average 

0.128 

0.138 

0.162 

Standard  deviation 

0.024 

0.024 

0.050 

CV  (percent) 

18.8 

17.4 

30.6 

from  a  beta-version  of  NCSTRL  and  early 

February  1998. 

Gross  characteristics  of  the 

collection  at  these  two  times  are  given  in  Table  5. 

Table  5 

Characteristics  of  the  NCSTRL  collection  at  two  times 

July  1995 

February  1998 

Number  of  sites 

29 

102 

Number  of  ‘large’  sites  (>30  docs) 

18 

73 

Number  of  docs 

8450 

21,357 

Number  of  docs  at  large  sites 

8397 

21,158 
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Fig.  4.  Site  size  versus  statistical  locality  in  NCSTRL  at  two  time  periods,  July  1995  (18  sites)  and  February  1998 
(73  sites). 

Each  document  in  NCSTRL  is  a  bibliographic  record  that  includes  at  a  minimum,  author, 
date  and  title.  A  large  percentage  of  the  records  contain  abstracts  and  on-line  full  text  in  a 
variety  of  formats  (OCR,  postscript,  page  images).  We  make  the  normal  IR  assumption  that 
the  available  terms  in  a  text  adequately  express  the  topical  content  of  the  document  it 
represents  (Salton  &  McGill,  1983).  For  our  locality  measurements,  we  considered  only  the  text 
contained  in  the  title  and  abstract.  This  text  was  preprocessed  by  removing  common  words  and 
stripping  words  to  their  stems.  Using  the  S  measure,  we  then  compared  the  statistical  signature 
of  each  site  against  a  central  collection  composed  of  the  entire  bibliographic  collection. 

Results  of  this  operation  are  given  in  Table  6  along  with  the  collection  size  of  each  site. 
There  is  a  rough,  inverse  correlation  between  the  size  of  the  site  and  the  locality  of  that  site, 
which  is  depicted  graphically  in  Fig.  4.  This  is  to  be  expected.  The  older,  well-established 
departments  have  larger  collections  and  tend  to  be  more  representative  of  the  entire  collection 
than  smaller  departments  which  tend  to  focus  on  a  small  number  of  selected  research  areas. 
For  example,  the  SUNY-Buffalo  archive  is  heavily  theory  oriented  and  the  Iowa  State  archive 
has  emphasis  in  languages  and  search  algorithms.  Fig.  4  also  shows  that  the  NCSTRL 
collection  has  grown  more  content-localized  over  time  and  that  the  site-to-site  variation  of 
measured  locality  has  also  increased. 

As  we  have  noted,  the  content-locality  for  the  18  ‘original’  NCSTRL  sites  increased  over 
time.  In  July  95,  each  site  made  up  a  larger  proportion  of  the  NCSTRL  DL  than  it  does  now 
and  thus  we  would  expect  it  to  be  more  representative  of  the  entire  DL.  At  the  later  time,  each 
site  is  less  representative,  thus  content-locality  should  be  higher.  This  is  exactly  the  effect  we 
observed. 

Examination  of  Table  6  also  yields  some  insight  into  the  challenges  of  fielding  operational 
distributed  digital  libraries  (Lagoze,  Fielding,  &  Payette,  1998).  For  example,  the  Computer 
Science  Departments  at  several  sites  show  little  if  any  growth  in  their  archives  though  in  reality 
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they  continue  to  produce  technical  reports.  This  is  related  to  software  version  difficulties  and 
technical  support  considerations  rather  than  actual  report  production.  Though  some  of  the 
original  sites  have  not  contributed  any  new  documents  to  the  DL,  their  content  locality  has 
changed.  This  is  to  be  expected,  because  content-locality  is  measured  with  respect  to  the  entire 
DL  and  the  entire  DL  has  changed  considerably  in  between  the  two  snapshots. 

We  also  note  that  the  average  measured  locality  of  NCSTRL  increased  from  0.128  to  0.162 
over  the  30  month  time  period  and  the  coefficient  of  variation  (CV)  increased  as  well,  from 
18.8  to  30.6%.  As  the  DL  grows,  it  is  becoming  more  content-localized,  not  less. 


7.  Discussion 

Operational  distributed  document  collections  are  only  now  starting  to  be  deployed.  The 
usefulness  of  content-locality  monitoring  is  still  undetermined.  The  major  motivation  in  the 
context  of  this  work  is  determining  whether  or  not  member  sites  in  the  distributed  document 
collection  need  to  communicate  statistical  information  about  their  local  collections  to  other 
member  sites.  If  locality  is  low,  then  no  communication  is  needed.  If  locality  is  high,  then 
communication  is  needed  to  maintain  good  retrieval  effectiveness. 

Of  course,  this  begs  the  question  about  what  exactly  is  ‘low’  and  ‘high’.  Clearly,  if  cr  =  0.01 
then  locality  is  low  and  if  cr  =  0.98  locality  is  high.  However,  if  cr  =  0.40,  then  in  what  situation 
are  we?  There  are  two  questions  to  answer.  First,  how  ‘skewed’  is  a  topic  that  shows  locality  of 
0.40?  Second,  does  a  =  0.40  mean  the  system  will  show  reduced  search  effectiveness? 

To  address  the  first  question,  we  selected  three  topics  from  the  TREC  topics  showing  ‘low’, 
‘average’  and  ‘high’  locality  relative  to  the  observed  locality  of  the  entire  250  topics.  The  text 


Topic 

Text  Excerpt 

161 

Document  will  provide  information  on  the  problems  and  actions  associated 
with  what  is  known  as  acid  rain. 

152 

Accusations  of  Cheating  by  Contractors  on  U.S.  Defense  Projects.  Document 
will  refer  to  an  alleged  illegality  committed  by  any  entity  seeking  a  contract 
on  behalf  of  the  U.S.  Military  Forces. 

37 

Document  identifies  software  products  which  adhere  to  IBM’s  SAA  stan¬ 
dards.  To  be  relevant,  a  document  must  identify  a  piece  of  software  which 
is  considered  a  Systems  Application  Architectural  (SAA)  component  or  one 
which  conforms  to  SAA. 
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Fig.  5.  Three  topics  from  the  TREC  collection.  Excerpts  of  text  from  the  topics  are  at  top  and  the  percent 
distribution  of  each  topic  at  each  site  is  at  the  bottom.  Topics  were  chosen  to  reflect  ‘low’  (topic  161),  ‘average’ 
(topic  152)  and  ‘high’  (topic  37)  locality. 
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of  these  three  topics  appears  at  the  top  of  Fig.  5  and  the  distribution  of  the  topic  among  the 
13  TREC  ‘sites’  is  given  at  the  bottom  of  Fig.  5.  The  topic  distribution  gives  some  useful 
intuition  into  what  a  value  of  a  means  operationally.  The  distribution  of  the  ‘low’  locality 
topic  161  is  concentrated  at  three  main  sites,  but  is  represented  at  all  sites  except  one.  Topic 
152,  the  medium  locality  topic,  is  also  concentrated  at  three  sites,  but  is  represented  at  fewer 
sites  than  topic  161,  so  locality  is  higher  here.  At  the  ‘high’  locality  end,  topic  37  is  represented 
almost  entirely  at  two  sites  with  only  token  representation  at  three  other  sites.  Locality  is 
highest  for  this  topic. 

Regarding  the  second  question,  it  is  reasonable  to  conclude  that  the  NCSTRL  archive  taken 
as  a  whole  is  a  skewed  collection,  the  wide  variation  in  individual  site  localities  lends  credence 
to  this  conclusion.  However,  we  cannot  dehnitively  say  that,  of  0.138  means  that  sites  need  to 
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Fig.  6.  An  example  of  content-locality  in  time.  This  figure  shows  the  temporal  distribution  of  relevant  documents  in 
the  AP  Newswire  (1988-1990)  for  two  TREC  topics  at  granularities  of  1  and  7  days.  One  topic  asks  about 
presidential  politics,  the  other  asks  about  terrorist  activities. 
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communicate  with  each  other  in  order  to  maximize  effectiveness.  What  is  needed  is  systematic, 
concurrent  monitoring  of  both  content-locality  and  user  satisfaction  in  operational  systems. 
With  the  deployment  of  working  systems  like  NCSTRL,  the  ability  to  do  such  work  is  only 
now  becoming  possible. 

7.1.  Temporal  locality 

In  addition  to  content-locality  in  space,  the  possibility  of  locality  in  time  exists  as  well.  In 
temporally  ordered  document  collections  like  news  archives,  there  is  a  form  of  topical 
clustering  in  time  that  naturally  arises  in  topics  that  are  related  to  events  (natural  disasters, 
elections),  days  of  the  week  (church  services,  Friday  night  football  games)  and  seasons  (snow, 
fall  foliage,  summer  beach  traffic). 

Most  work  focusing  on  ad-hoc  queries  assumes  that  the  temporal  distribution  of  relevant 
documents  is  uniform.  That  is,  the  probability  of  relevance  is  independent  of  when  the 
document  was  created.  Intuitively,  this  appears  to  be  an  invalid  assumption  for  at  least  the 
kinds  of  queries  outlined  above.  In  such  cases,  knowledge  of  the  time-based  probability 
distribution  of  relevance  might  aid  considerably  in  focusing  retrieval. 

Focused  work  on  event  detection  and  tracking  using  transcriptions  of  news  broadcasts  has 
shown  that  the  explicit  use  of  temporal  features  can  help  with  retrieval.  For  example,  Allan, 
Papka,  and  Lavrenko  (1998)  explicitly  factor  in  time  when  determining  whether  a  news  story  is 
part  of  a  previously  detected  event  or  describes  a  new  event  while  Yang,  Pierce,  and  Carbonell 
(1998)  use  temporal  proximity  as  a  feature  in  event-focused  document  clustering. 

Fig.  6  provides  evidence  to  support  this  intuition.  Depicted  is  the  temporal  distribution  of 
relevance  in  a  3  year  period  of  the  AP  Newswire  for  two  TREC  topics.  In  both  cases,  the 
distribution  of  relevance  is  decidedly  non-uniform. 


8.  Summary 

The  notion  of  content-locality  in  distributed  document  collections  is  new.  The  work  we 
describe  in  this  paper  is  an  effort  to  more  fully  understand  the  underlying  phenomenon. 
Through  the  introduction  of  two  possible  measurement  methodologies,  we  have  made 
considerable  progress  in  this  regard.  Using  these  methods,  we  measured  the  topic  locality  of 
the  TREC  collection  and  statistical  locality  of  an  operational  distributed  collection,  NCSTRE. 

The  fidelity  of  the  topic-based  locality  measurement  rests  entirely  on  the  accuracy  of  topic 
determination.  If  good  topic  descriptions  (e.g.  subject  descriptions)  are  available,  then  locality 
is  easy  to  calculate.  Content-locality  is  only  one  of  many  reasons  that  topic  identification  is 
worth  knowing.  Others  include  improved  efficiency,  faster  retrieval  and  more  effective 
browsing.  Comparison  of  content-locality  derived  from  different  systems  should  not  be  done 
lightly.  The  method  of  topic  identification  is  the  key  to  enabling  a  reasonable  comparison.  To 
the  extent  that  topic  identification  differs  between  systems,  direct  comparison  may  become 
increasingly  meaningless. 

The  statistic-based  locality  measure  has  potential  of  being  implemented  operationally 
because  it  is  wholly  automatic  and  easy  to  calculate.  No  topic  determination  is  needed.  Scaling 
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problems  need  to  be  overcome  and  methods  to  account  for  or  remove  rare  term  and  small 
collection  artifacts  need  to  be  devised. 

Since  temporal  topic  locality  exists,  there  are  a  number  of  follow-up  questions  to  pursue. 
How  do  we  measure  it?  Can  we  gain  insight  from  other  phenomenon  that  exhibit  locality,  e.g. 
memory  reference  patterns?  Is  there  a  detectable  attribute  in  the  query  that  might  indicate  a 
temporal  component  to  relevance?  If  so,  then  knowledge  of  the  ‘highly  relevant’  region(s) 
might  significantly  improve  retrieval  effectiveness  for  these  queries  through  focused  search  in 
the  region(s).  What  is  the  relationship  between  ‘topic’  and  ‘event’  (Allan  et  al.,  1998)?  Are 
there  distinct  types  of  temporal  patterns  associated  with  topics,  e.g.  uni-modal,  multimodal, 
periodic?  Again,  such  knowledge  could  further  focus  search  efforts. 

Open  questions  remain  about  how  best  to  take  advantage  of  content  locality  in  distributed 
digital  libraries.  Locality  is  highly  context  sensitive,  term  distributions  may  show  high  locality 
in  one  distributed  archive  and  low  locality  in  another.  What  to  do  with,  for  example,  statistical 
information  then  becomes  context  sensitive  as  well.  Sometimes  sites  may  need  to  share 
information  to  achieve  high  effectiveness  on  a  search,  while  other  times  such  sharing  may  not 
be  needed. 

Locality  in  time  and  space  is  not  a  new  concept,  being  integral  in  a  wide  variety  of  areas 
including  analysis  of  memory  reference  patterns  (Madison  &  Batson,  1976;  Weikle,  McKee,  & 
Wulf,  1998),  file  caching  in  networked  file  systems  (Satyanarayanan,  1989)  and  caching  in 
various  distributed  computer  systems.  World  Wide  Web  servers  and  browsers  being  the  most 
obvious  current  example.  Careful  consideration  of  this  literature  may  provide  deeper  insight 
into  the  nature  of  content-locality  and  methods  to  measure  it. 
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Appendix  A.  Notes  on  a 

A.l.  Bounds  on  cr, 

From  the  definitions  of  c,  and  Table  1  we  get 
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These  constraints  mean  that  we  can  put  bounds  on  cr„  specifically,  0  <  cr,  <  V2.  The  only  time 
cr,  =  0  is  when  Cs  =  bsj,  To  see  that  cr,  <  Vl,  consider  the  following: 
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The  situations  when  cr,  is  close  to  are  when  a  topic  is  completely  located  at  a  site  that  is 
very  small  relative  to  other  sites.  A  simple  example  is  a  2  site  system  where  6s,/  =  (0,  1)  and 
C5  =  (0.9,  0.1).  In  this  case  we  get 
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However,  if  sites  are  approximately  the  same  proportion,  namely  1/5,  then  we  can  derive  a 
tighter  bound  than  s/2: 
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A. 2.  Properties  of  cr. 

To  appreciate  the  behavior  of  Of,  we  first  consider  the  set  of  systems  where  a  topic  is 
uniformly  distributed  over  k=\,  2,  ...,  5  sites.  The  locality  measure  should  decrease 
monotonically  as  k  increases.  For  ease  of  analysis,  assume  that  the  size  of  each  site  is 
constant  i.e.  V5,  Cs  =  1/5.  When  k=\. 


s  / - 

-  c,f  =  V(1  -  1/5)2  +  (5  -  1)(  1/5)2  =  Vl  -2/5 +1/52  + (5-  l)(l/52) 

5  .v=i 

=  yr^. 

When  k  =  2, 

ot  =  72(1/2 -1/5)2  + (5 -2)(l/52)  =  72(1/4  -  1/5+  1/52)  +  (5  -  2)(l/52) 

=  71/2-1/5, 
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and  for  arbitrary  k, 

ot  =  ^k{\/k-\/Sf  +  {S-k){\/S^)  =  s/k{\/k^  -  {2/k){\/S)  +  1/52)  +  {S-  k){\/S^) 
=  y/\/k-\IS. 
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